A systematic study on parameter correlations in large-scale duplicate document detection
نویسندگان
چکیده
منابع مشابه
A systematic study on parameter correlations in large scale duplicate document detection 1
Although much work has been done on duplicate document detection (DDD) and its applications, we observe the absence of a systematic study on the performance and scalability of large-scale DDD algorithms. It is still unclear how various parameters in DDD correlate mutually, such as similarity threshold, precision/recall requirement, sampling ratio, and document size. This paper explores the corr...
متن کاملA systematic study of parameter correlations in large scale duplicate document detection
Although much work has been done on duplicate document detection (DDD) and its applications, we observe the absence of a systematic study of the performance and scalability of large-scale DDD. It is still unclear how various parameters of DDD, such as similarity threshold, precision/recall requirement, sampling ratio, document size, correlate mutually. In this paper, correlations among several ...
متن کاملDuplicate Image Detection in Large Scale Databases
We propose an image duplicate detection method for identifying modified copies of the same image in a very large database. Modifications that we consider include rotation, scaling and cropping. A compact 12 dimensional descriptor based on Fourier Mellin Transform is introduced. The compactness of this descriptor allows efficient indexing over the entire database. Results are presented on a 10 m...
متن کاملDuplicate document detection
In document image filing applications it is important to be able to recognize whether a particular document has already been entered into the system either as an individual document or as an inclusion in another document. Document images could be matched on the basis of layout or contents. However, matching of layout may not be effective when style is strictly controlled. We develop a document ...
متن کاملNear Duplicate Document Detection for Large Information Flows
Near duplicate documents and their detection are studied to identify info items that convey the same (or very similar) content, possibly surrounded by diverse sets of side information like metadata, advertisements, timestamps, web presentations and navigation supports, and so on. Identification of near duplicate information allows the implementation of selection policies aiming to optimize an i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Knowledge and Information Systems
سال: 2007
ISSN: 0219-1377,0219-3116
DOI: 10.1007/s10115-007-0071-9